@Askell et al.

mentions 1 type Person feed RSS

19:23

2026-06-04

lesswrong.com

large-language-models

(Mis)generalization of Helpful-Only Fine-tuning

Researchers studying helpful-only (H-only) large language models found that existing models exhibit emergent misalignment, residual refusal behaviors, poor steerability, sycophancy, and incoherent cha…

// co-occurs with top 4 entities

Anthropic 1 Bai et al. 1 Greenblatt et al. 1 Roger 1